SeqProp - Protein Sequence Properties

This notebook gives an overview the available calculations for properties of a single protein sequence.

Input: Amino acid sequence
Output: Amino acid sequence properties

Note

See ssbio.protein.sequence.seqprop.SeqProp for a description of all the available attributes and functions.

Imports

In [ ]:
import sys
import logging
import os.path as op
In [ ]:
# Import the SeqProp class
from ssbio.protein.sequence.seqprop import SeqProp
In [ ]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff
In [ ]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [ ]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project

Set these two things:

  • PROTEIN_ID
    • Your protein ID
  • PROTEIN_SEQ
    • Your protein sequence
In [ ]:
# SET IDS HERE
PROTEIN_ID = 'YIAJ_ECOLI'
PROTEIN_SEQ = 'MGKEVMGKKENEMAQEKERPAGSQSLFRGLMLIEILSNYPNGCPLAHLSELAGLNKSTVHRLLQGLQSCGYVTTAPAAGSYRLTTKFIAVGQKALSSLNIIHIAAPHLEALNIATGETINFSSREDDHAILIYKLEPTTGMLRTRAYIGQHMPLYCSAMGKIYMAFGHPDYVKSYWESHQHEIQPLTRNTITELPAMFDELAHIRESGAAMDREENELGVSCIAVPVFDIHGRVPYAVSISLSTSRLKQVGEKNLLKPLRETAQAISNELGFTVRDDLGAIT'
In [ ]:
# Create the SeqProp object
my_seq = SeqProp(id=PROTEIN_ID, seq=PROTEIN_SEQ)
SeqProp.write_fasta_file(outfile, force_rerun=False)[source]

Write a FASTA file for the protein sequence, seq will now load directly from this file.

Parameters:
  • outfile (str) – Path to new FASTA file to be written to
  • force_rerun (bool) – If an existing file should be overwritten
In [ ]:
# Write temporary FASTA file for property calculations that require FASTA file as input
import tempfile
ROOT_DIR = tempfile.gettempdir()

my_seq.write_fasta_file(outfile=op.join(ROOT_DIR, 'tmp.fasta'), force_rerun=True)
my_seq.sequence_path

Computing and storing protein properties

A SeqProp object is simply an extension of the Biopython SeqRecord object. Global properties which describe or summarize the entire protein sequence are stored in the annotations attribute, while local residue-specific properties are stored in the letter_annotations attribute.

Basic global properties

SeqProp.get_biopython_pepstats()[source]

Run Biopython’s built in ProteinAnalysis module and store statistics in the annotations attribute.

In [ ]:
# Global properties using the Biopython ProteinAnalysis module
my_seq.get_biopython_pepstats()
{k:v for k,v in my_seq.annotations.items() if k.endswith('-biop')}
SeqProp.get_emboss_pepstats()[source]

Run the EMBOSS pepstats program on the protein sequence.

Stores statistics in the annotations attribute. Saves a .pepstats file of the results where the sequence file is located.

In [ ]:
# Global properties from the EMBOSS pepstats program
my_seq.get_emboss_pepstats()
{k:v for k,v in my_seq.annotations.items() if k.endswith('-pepstats')}
SeqProp.get_aggregation_propensity(email, password, cutoff_v=5, cutoff_n=5, run_amylmuts=False, outdir=None)[source]

Run the AMYLPRED2 web server to calculate the aggregation propensity of this protein sequence, which is the number of aggregation-prone segments on the unfolded protein sequence.

Stores statistics in the annotations attribute, under the key aggprop-amylpred.

See ssbio.protein.sequence.properties.aggregation_propensity for instructions and details.

In [ ]:
# Aggregation propensity - the predicted number of aggregation-prone segments on an unfolded protein sequence
my_seq.get_aggregation_propensity(outdir=ROOT_DIR, email='nmih@ucsd.edu', password='ssbiotest', cutoff_v=5, cutoff_n=5, run_amylmuts=False)
{k:v for k,v in my_seq.annotations.items() if k.endswith('-amylpred')}
SeqProp.get_kinetic_folding_rate(secstruct, at_temp=None)[source]

Run the FOLD-RATE web server to calculate the kinetic folding rate given an amino acid sequence and its structural classficiation (alpha/beta/mixed)

Stores statistics in the annotations attribute, under the key kinetic_folding_rate_<TEMP>-foldrate.

See ssbio.protein.sequence.properties.kinetic_folding_rate.get_foldrate() for instructions and details.

In [ ]:
# Kinetic folding rate - the predicted rate of folding for this protein sequence
secstruct_class = 'mixed'
my_seq.get_kinetic_folding_rate(secstruct=secstruct_class)
{k:v for k,v in my_seq.annotations.items() if k.endswith('-foldrate')}
SeqProp.get_thermostability(at_temp)[source]

Run the thermostability calculator using either the Dill or Oobatake methods.

Stores calculated (dG, Keq) tuple in the annotations attribute, under the key thermostability_<TEMP>-<METHOD_USED>.

See ssbio.protein.sequence.properties.thermostability.get_dG_at_T() for instructions and details.

In [ ]:
# Thermostability - prediction of free energy of unfolding dG from protein sequence
# Stores (dG, Keq)
my_seq.get_thermostability(at_temp=32.0)
my_seq.get_thermostability(at_temp=37.0)
my_seq.get_thermostability(at_temp=42.0)
{k:v for k,v in my_seq.annotations.items() if k.startswith('thermostability_')}